Vehicle insurance (also known as car insurance, motor insurance, or auto insurance) is insurance for cars, trucks, motorcycles, and other road vehicles. Its primary use is to provide financial protection against physical damage or bodily injury resulting from traffic collisions and against liability that could also arise from incidents in a vehicle. Vehicle insurance may additionally offer financial protection against theft of the vehicle, and against damage to the vehicle sustained from events other than traffic collisions, such as keying, weather or natural disasters, and damage sustained by colliding with stationary objects. The specific terms of vehicle insurance vary with legal regulations in each region.
Reference: https://en.wikipedia.org/wiki/Vehicle_insurance
Our goal is to build a model from Health insurance customer data to predict whether they interest in purchasing vehicle insurance policy. https://www.kaggle.com/anmolkumar/health-insurance-cross-sell-prediction
import sys
import warnings
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
if not sys.warnoptions:
warnings.simplefilter("ignore")
from scipy.stats import norm
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_auc_score, auc, roc_curve
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold, cross_val_score, learning_curve, cross_validate, train_test_split, KFold, cross_val_score
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import OrdinalEncoder
from imblearn.over_sampling import SMOTE
from xgboost import XGBClassifier
import plotly.express as px
def count_plot(df,feat,palette='rainbow'):
plt.style.use('seaborn')
sns.set_style('whitegrid')
labels=df[feat].value_counts().index
values=df[feat].value_counts().values
plt.figure(figsize=(15,5))
ax = plt.subplot2grid((1,2),(0,0))
sns.barplot(x=labels, y=values,palette=palette, alpha=0.75)
for i, p in enumerate(ax.patches):
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2., height + 0.1, values[i],ha="center")
plt.title('Response of Customer', fontsize=15, weight='bold')
plt.show()
Respone is our target variable where 1 means customers interested in vehichle insurance or 0 when not interested. So this is a classification task. Also when we look at the target distribution it's clear that we have imbalance between labels. We can try to up or down sample data for increasing accuracy.
Oversampling and undersampling in data analysis are techniques used to adjust the class distribution of a data set (i.e. the ratio between the different classes/categories represented). These terms are used both in statistical sampling, survey design methodology and in machine learning.
Reference: https://en.wikipedia.org/wiki/Oversampling_and_undersampling_in_data_analysis
count_plot(data,'Response')
Gender distribution in data looks balanced.
count_plot(data,'Gender','Purples')
plt.show()
Let's see if Age has any effects on response target variable.
Some Age ranges have more interest in Vehicle insurance. So, it will be better to group ages regarding to above distributions
bins = [20, 30, 40, 50, 60, 70, 80,90]
labels = ['20-29', '30-39', '40-49', '50-59', '60-69', '70-79','80+']
data['AgeClass']=pd.cut(data.Age, bins, labels = labels,include_lowest = True)
test_df['AgeClass']=pd.cut(test_df.Age, bins, labels = labels,include_lowest = True)
data[['Age','AgeClass']].head(5)
Older people having more damaged cars.
with sns.axes_style(style='ticks'):
g = sns.factorplot("Vehicle_Damage", "Age", "Gender", data=data, kind="box")
g.set_axis_labels("Vehicle_Damage", "Age");
Let's decide which features are categorical and numeric. This will be later used for encoding purposes.
In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set. An outlier can cause serious problems in statistical analyses. Reference: https://en.wikipedia.org/wiki/Outlier
The interquartile range (IQR) is often used to find outliers in data. Outliers here are defined as observations that fall below Q1 â 1.5 IQR or above Q3 + 1.5 IQR. In a boxplot, the highest and lowest occurring value within this limit are indicated by whiskers of the box (frequently with an additional bar at the end of the whisker) and any outliers as individual points. Reference: https://en.wikipedia.org/wiki/Interquartile_range#Outliers
def detect_outliers(df,feat):
Q1 = data[feat].quantile(0.25)
Q3 = data[feat].quantile(0.75)
IQR = Q3 - Q1
#data[~ ((data['Annual_Premium'] < (Q1 - 1.5 * IQR)) |(data['Annual_Premium'] > (Q3 + 1.5 * IQR))) ]
return df[((df[feat] < (Q1 - 1.5 * IQR)) |(data[feat] > (Q3 + 1.5 * IQR))) ].shape[0]
def clean_outliers(df,feat):
Q1 = data[feat].quantile(0.25)
Q3 = data[feat].quantile(0.75)
IQR = Q3 - Q1
return df[~ ((df[feat] < (Q1 - 1.5 * IQR)) |(data[feat] > (Q3 + 1.5 * IQR))) ]
clean_data=clean_outliers(data,'Annual_Premium')
clean_data.shape
We split data into 33% test and rest for training.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(clean_data[data_cats+data_nums], clean_data.Response, test_size=0.33, random_state=1)
OrdinalEncoder/LabelEncoder: When order is important for categorical variables, it's important to use sklearn OrdinalEncoder or LabelEncoder. eg. cold, warm, hot
One Hot Encoding: When order is NOT important we can use sklearn OneHotEncoder or pandas get_dummies function. eg. Gender is an example Female,Male
There is two rows in test data which has different Policy Sales Channel not exists in train data. It's 141 and 142. Just 2 of them so we replace them with 140.
def prepare_inputs(train):
oe = OrdinalEncoder()
oe.fit(train)
return oe
oe=prepare_inputs(data[data_cats])
X_train_enc=oe.transform(X_train[data_cats])
X_test_enc=oe.transform(X_test[data_cats])
# there is 2 unknown new Policy_Sales_Channel values in test 141 and 142
# we replace them with 140
test_df.loc[test_df['Policy_Sales_Channel']==141.0, 'Policy_Sales_Channel']=140.0
test_df.loc[test_df['Policy_Sales_Channel']==142.0, 'Policy_Sales_Channel']=140.0
test_df_enc=oe.transform(test_df[data_cats])
all_train_enc=np.concatenate((X_train_enc, X_train[data_nums].values), axis=1)
all_test_enc=np.concatenate((X_test_enc, X_test[data_nums].values), axis=1)
all_test_df_enc=np.concatenate((test_df_enc, test_df[data_nums].values), axis=1)
SelectKBest score functions:
For Regression: f_regression, mutual_info_regression
For Classification: chi2, f_classif, mutual_info_classif
Chi2 in general for categorical variables. We use mutual_info_classif which is suitable for mixed variables not just categorical or numerical ones ð
Here we see adding age groups as new feature doest not bring any improvements. Age and Ageclass have same feature importance ð
Depending of the kscores we can drop some non useful features form dataset for example vintage here is the lowers k-score we may drop it if we want.
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2, mutual_info_classif
# chi2 for categorical variables
# mutual_info_classif for mixed variables
fs = SelectKBest(score_func=mutual_info_classif, k='all')
fs.fit(all_train_enc, y_train)
X_train_fs = fs.transform(all_train_enc)
for i in range(len(fs.scores_)):
print('%s: %f' % (data_all[i], fs.scores_[i]))
plt.figure(figsize=(18,8))
sns.barplot(data_all, fs.scores_, orient='v')
plt.title('Categorical Feature Selection with mutual_info_classif')
plt.show()
from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import ADASYN
#ros = RandomOverSampler(random_state=42, sampling_strategy='minority')
#all_train_enc_over_sampled, y_train_over_sampled = ros.fit_resample(all_train_enc, y_train)
ada = ADASYN(random_state=42)
all_train_enc_over_sampled, y_train_over_sampled = ada.fit_resample(all_train_enc, y_train)
y_train=y_train_over_sampled
import plotly.express as px
from sklearn.decomposition import PCA
n_components = 2
pca = PCA(n_components=n_components)
components = pca.fit_transform(all_train_enc_over_sampled)
total_var = pca.explained_variance_ratio_.sum() * 100
fig = px.scatter(components, x=0, y=1, color=y_train, title=f'Total Explained Variance: {total_var:.2f}%',)
fig.show()